fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362
Open
tp5uiuc wants to merge 3 commits into
Open
fix(runtime): autosave engine-implicit RuntimeCache via atexit + weakref#4362tp5uiuc wants to merge 3 commits into
tp5uiuc wants to merge 3 commits into
Conversation
6d02dad to
a0a67d7
Compare
a0a67d7 to
a3770be
Compare
The previous `__del__`-only autosave path silently lost cache updates when the engine survived until interpreter exit (typical for inference servers). Python tears down `sys.meta_path` early in shutdown; the torchbind attribute access in `self._handle.path` and the lazy `filelock` import inside `save()` then raise `ImportError: sys.meta_path is None`, which escaped `__del__` and surfaced as a noisy `Exception ignored in __del__` once per surviving handle. `atexit` callbacks run *before* module teardown, so the torchbind path and lazy imports still resolve there. Register an `atexit` hook from `__init__` whenever `autosave_on_del=True`. The hook closes over a `weakref.ref(self)` so it doesn't pin the handle alive: a handle that dies mid-program still goes through `__del__`, and the atexit hook later sees a dead weakref and no-ops. Other design points worth calling out: * `_autosave_at_exit` is a module-level helper, not a bound method. A bound method captures `self` via `__self__`, which would defeat the weakref. The free function lets the closure carry only the weakref. * Both `__del__` and the atexit hook flip `autosave_on_del` off before saving so whichever path runs first wins and the other no-ops -- no double-save, no double-leak risk. * `__del__` unregisters its atexit token. Without this, a long-running process that churns engine-implicit handles (model swaps, A/B rollouts) accumulates dead atexit entries -- small per entry but unbounded. * The `try` in `__del__` still wraps the whole body so any residual attribute-access failure during late-shutdown corner cases is swallowed rather than leaking to `sys.unraisablehook`. Tests added in `TestRuntimeCacheAutosave`: - `test_del_swallows_shutdown_import_error_on_path`: monkey-patches `_handle.path` to raise the shutdown `ImportError`; asserts via `sys.unraisablehook` that nothing leaks. - `test_atexit_hook_saves_via_weakref`: exercises the helper directly, verifies it saves and flips `autosave_on_del`. - `test_atexit_hook_no_op_on_dead_weakref`: dead weakref => no-op, no exception. - `test_atexit_token_unregistered_after_del`: `atexit.unregister` is spied to confirm `__del__` cleaned up. Refs pytorch#4359
a3770be to
eacb80e
Compare
tp5uiuc
commented
Jun 26, 2026
tp5uiuc
commented
Jun 26, 2026
| except Exception: | ||
| pass | ||
|
|
||
| def __getstate__(self) -> dict[str, Any]: |
Collaborator
Author
There was a problem hiding this comment.
Note to reviewers : these are added so that users calling pickle.save(RuntimeCache()) works as expected.
The atexit + weakref design from the previous commit added an ``_atexit_token`` slot (``partial(_autosave_at_exit, weakref.ref(self))``) to ``RuntimeCache``. ``weakref.ref`` is unpicklable, so any caller that pickled a ``RuntimeCache`` directly hit ``TypeError: cannot pickle 'weakref.ReferenceType'``. Two defenses: * ``__getstate__`` / ``__setstate__`` strip the token from the pickle stream and re-register a fresh atexit hook on unpickle when ``autosave_on_del`` was on. Atexit registration is a per-process artifact; the loading process gets its own hook bound to the loading- side instance. * ``__del__`` reads ``_atexit_token`` via ``getattr(..., None)``. ``__init__`` sets the slot first, but ``copy.deepcopy`` can crash mid-state-copy on an unrelated field (e.g. the python-runtime ``_RuntimeCacheHandle._lock``, which is ``threading.Lock``) and leave the new instance with only some attributes set. The defensive read keeps ``__del__`` quiet rather than raising ``AttributeError`` to ``sys.unraisablehook``. Test ``test_pickle_round_trip_strips_atexit_token`` exercises the standalone-pickle path with a ``SimpleNamespace`` stub for ``_handle`` so the test isolates the ``__getstate__``/``__setstate__`` logic from the orthogonal limitation that ``_RuntimeCacheHandle`` (python rt) is itself unpicklable today. The wrapper-side fix that prevents engine-implicit handles from being pickled via ``torch.save(module)`` at all -- plus the matching ``_RuntimeCacheHandle`` picklability fix on python-rt -- is intentionally a separate change; this commit only addresses pickle safety of the ``_atexit_token`` slot introduced in the prior commit. Refs pytorch#4359
78635f9 to
fbd48ce
Compare
3 tasks
…helper ``__init__`` and ``__setstate__`` both registered the same ``atexit.register(partial(_autosave_at_exit, weakref.ref(self)))`` call. Hoist into ``_register_atexit_autosave`` and reuse from both sites. The idempotency check (``_atexit_token is None``) moves into the helper, so a second call is a no-op regardless of caller. Behavior is unchanged. Addresses review comment on py/torch_tensorrt/runtime/_runtime_cache.py:227.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Original bug from #4359: engine-implicit
RuntimeCachehandles relied solely on__del__to autosave at process exit. When the engine survived to exit,__del__fired during shutdown and the torchbind /filelocklazy imports raisedImportError: sys.meta_path is None, surfacing asException ignored in __del__once per surviving handle.Fixes
atexit.register(partial(_autosave_at_exit, weakref.ref(self))).atexitfires before module teardown;__del__stays as the mid-program GC path. Both flip
autosave_on_deloff sowhichever runs first wins.
__del__unregisters its token to keep theatexit registry from accumulating dead entries.
weakrefkeeps the registrationnon-owning, so handles can still die mid-program normally.
RuntimeCache.__getstate__/__setstate__strip the unpicklableweakreffrom the pickle stream and re-register a fresh atexit hookon unpickle when
autosave_on_delwas on.__del__reads the tokenvia
getattr(..., None)socopy.deepcopypartial-init edge casesstay silent.
Dependency on a sibling PR
This PR introduces an
_atexit_tokenslot (partial(..., weakref.ref(self))) onRuntimeCache. CI tests that pickle compiled modules (test_mutable_torchtrt_module::test_save,test_cross_runtime_serde::test_save_*) walk through_runtime_settings.runtime_cacheand would hit the weakref. The matching wrapper-side pickle exclusion lives in #4368 and should land first to unblock CI here.Fixes #4359.
Type of change
Test plan
TestRuntimeCacheAutosaveintests/py/dynamo/runtime/test_000_runtime_cache.py:test_del_swallows_shutdown_import_error_on_path—sys.unraisablehooksees no leaktest_atexit_hook_saves_via_weakref— helper saves + flips flagtest_atexit_hook_no_op_on_dead_weakref— dead weakref no-ops cleanlytest_atexit_token_unregistered_after_del—atexit.unregistercalled with the registered tokentest_pickle_round_trip_strips_atexit_token— standalone-pickle round-trip rewires a fresh hook (stubbed handle to isolate from the sibling PR's_RuntimeCacheHandlepicklability work)pre-commit runclean on touched files (mypy skipped for two pre-existing errors at_TorchTensorRTModule.py:380and:508, unrelated)